Dataset: Hotel booking demand

Mateusz Krzyziński, Anna Urbala, Artur Żółkowski

Wstępne informacje o zbiorze

Zbiór pochodzi z publikacji w czasopismie Data in Brief.

Tak naprawdę są to dwa zbiory połączone w jeden. Dane dotyczą bowiem dwóch hoteli ulokowanych w Portugalii: jednego miejskiego (w Lizbonie), drugiego w typie kurortu (resort hotel, w regionie Algarve).

Zbiór zawiera informacje o rezerwacjach w obu hotelach w okresie od 1 lipca 2015 do 31 sierpnia 2017 roku.

Wczytanie pakietów i danych, pierwsze spostrzeżenia

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

pd.options.mode.chained_assignment = None 
In [3]:
df = pd.read_csv("data/hotel_bookings.csv")
In [9]:
pd.DataFrame(df.groupby("country")["is_canceled"].mean().sort_values()).head(20)
Out[9]:
is_canceled
country
ABW 0.0
KIR 0.0
SMR 0.0
JAM 0.0
IRQ 0.0
STP 0.0
SUR 0.0
KNA 0.0
PYF 0.0
SYR 0.0
TGO 0.0
DMA 0.0
DJI 0.0
CYM 0.0
CUB 0.0
GTM 0.0
LAO 0.0
LBY 0.0
LCA 0.0
PLW 0.0
In [ ]:
df[df["reserved_room_type"] == df["assigned_room_type"]]
In [26]:
df.loc[df["days_in_waiting_list"] > 0, "days_in_waiting_list"].hist(bins = 100)
Out[26]:
<AxesSubplot:>
In [36]:
df.groupby(["days_in_waiting_list"])["is_canceled"].mean()
Out[36]:
days_in_waiting_list
0      0.361866
1      0.250000
2      0.200000
3      1.000000
4      0.320000
         ...   
236    0.171429
259    0.000000
330    0.066667
379    0.600000
391    1.000000
Name: is_canceled, Length: 128, dtype: float64
In [33]:
df.groupby(["assigned_room_type"])["is_canceled"].mean()
Out[33]:
assigned_room_type
A    0.444925
B    0.236708
C    0.187789
D    0.251244
E    0.252114
F    0.247134
G    0.305523
H    0.352528
I    0.013774
K    0.043011
L    1.000000
P    1.000000
Name: is_canceled, dtype: float64
In [3]:
# Spójrzmy, jak to wygląda
df
Out[3]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.00 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.00 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.00 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.00 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.00 0 1 Check-Out 2015-07-03
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
119385 City Hotel 0 23 2017 August 35 30 2 5 2 ... No Deposit 394.0 NaN 0 Transient 96.14 0 0 Check-Out 2017-09-06
119386 City Hotel 0 102 2017 August 35 31 2 5 3 ... No Deposit 9.0 NaN 0 Transient 225.43 0 2 Check-Out 2017-09-07
119387 City Hotel 0 34 2017 August 35 31 2 5 2 ... No Deposit 9.0 NaN 0 Transient 157.71 0 4 Check-Out 2017-09-07
119388 City Hotel 0 109 2017 August 35 31 2 5 2 ... No Deposit 89.0 NaN 0 Transient 104.40 0 0 Check-Out 2017-09-07
119389 City Hotel 0 205 2017 August 35 29 2 7 2 ... No Deposit 9.0 NaN 0 Transient 151.20 0 2 Check-Out 2017-09-07

119390 rows × 32 columns

In [4]:
df.describe()
Out[4]:
is_canceled lead_time arrival_date_year arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies is_repeated_guest previous_cancellations previous_bookings_not_canceled booking_changes agent company days_in_waiting_list adr required_car_parking_spaces total_of_special_requests
count 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119386.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 103050.000000 6797.000000 119390.000000 119390.000000 119390.000000 119390.000000
mean 0.370416 104.011416 2016.156554 27.165173 15.798241 0.927599 2.500302 1.856403 0.103890 0.007949 0.031912 0.087118 0.137097 0.221124 86.693382 189.266735 2.321149 101.831122 0.062518 0.571363
std 0.482918 106.863097 0.707476 13.605138 8.780829 0.998613 1.908286 0.579261 0.398561 0.097436 0.175767 0.844336 1.497437 0.652306 110.774548 131.655015 17.594721 50.535790 0.245291 0.792798
min 0.000000 0.000000 2015.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 6.000000 0.000000 -6.380000 0.000000 0.000000
25% 0.000000 18.000000 2016.000000 16.000000 8.000000 0.000000 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 62.000000 0.000000 69.290000 0.000000 0.000000
50% 0.000000 69.000000 2016.000000 28.000000 16.000000 1.000000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 14.000000 179.000000 0.000000 94.575000 0.000000 0.000000
75% 1.000000 160.000000 2017.000000 38.000000 23.000000 2.000000 3.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 229.000000 270.000000 0.000000 126.000000 0.000000 1.000000
max 1.000000 737.000000 2017.000000 53.000000 31.000000 19.000000 50.000000 55.000000 10.000000 10.000000 1.000000 26.000000 72.000000 21.000000 535.000000 543.000000 391.000000 5400.000000 8.000000 5.000000
In [5]:
# Ile jest wierszy, jakie sÄ… kolumny, gdzie siedzÄ… nulle? 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

W większości nazwy kolumn są samotłumaczące się, a te mniej oczywiste można sprawdzić we wspomnianej pracy. Ja raczej lubię wiedzieć, z czym pracuję, więc na przykład:

  • lead_time to liczba dni od wprowadzenia rezerwacji do systemu do dnia rozpoczÄ™cia pobytu,
  • distribution_channel i market_segment sÄ… zwiÄ…zane ze sposobem rezerwacji (market_segment specyfikuje dokÅ‚adniej distribution_channel),
  • agent to ID agencji podróży, przez którÄ… odbyÅ‚a siÄ™ rezerwacja,
  • company to ID firmy/jednostki, która zarezerwowaÅ‚a pobyt lub jest odpowieddzialna za pÅ‚atność,
  • adr to Average Daily Rate = przychód z rezerwacji / liczba zarezerwowanych pokoi.
In [6]:
# Czym się różni distribution_channel i market_segment? 
# market_segment to doprecyzowanie distribution_channel 
df[df["distribution_channel"] != df["market_segment"]][["distribution_channel","market_segment"]]
Out[6]:
distribution_channel market_segment
4 TA/TO Online TA
5 TA/TO Online TA
8 TA/TO Online TA
9 TA/TO Offline TA/TO
10 TA/TO Online TA
... ... ...
119385 TA/TO Offline TA/TO
119386 TA/TO Online TA
119387 TA/TO Online TA
119388 TA/TO Online TA
119389 TA/TO Online TA

102324 rows × 2 columns

In [7]:
df[df["distribution_channel"] == df["market_segment"]][["distribution_channel","market_segment"]]
Out[7]:
distribution_channel market_segment
0 Direct Direct
1 Direct Direct
2 Direct Direct
3 Corporate Corporate
6 Direct Direct
... ... ...
119358 Direct Direct
119365 Direct Direct
119368 Direct Direct
119369 Direct Direct
119381 Direct Direct

17066 rows × 2 columns

In [8]:
df["distribution_channel"].value_counts()
Out[8]:
TA/TO        97870
Direct       14645
Corporate     6677
GDS            193
Undefined        5
Name: distribution_channel, dtype: int64

Trzeba wyrzucić reservation_status, bo dubluje informacje o is_canceled.

In [123]:
df["reservation_status"].value_counts()
Out[123]:
Check-Out    75166
Canceled     43017
No-Show       1207
Name: reservation_status, dtype: int64
In [9]:
#Jak dużo mamy nulli?
df.isna().sum()
Out[9]:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64

Czym możemy zastąpić te NULLe?

  • brak informacji o liczbie dzieci - 0 (w szczególnoÅ›ci, że to tylko 4 rezerwacje),
  • brak informacji o kraju - "Unknown",
  • brak informacji o agencji - 0 (ID, nieużyte),
  • brak informacji o firmie - 0 (ID, nieużyte)
In [10]:
# Czy bezpośrednie rezerwacje to te, które mają NULL agenta? 
df[(df["distribution_channel"] == "Direct") & (df["agent"].isna())].shape[0]
# Nie wszystkie :(
Out[10]:
7625
In [11]:
# Okej, to trochę dziwne (?), ale część bezpośrednich rezerwacji było dokonywanych przez jakichś określonych agentów???
df[df["distribution_channel"] == "Direct"]["agent"].value_counts()
Out[11]:
14.0     3633
250.0    2827
273.0     249
305.0      45
240.0      43
68.0       40
45.0       30
385.0      30
38.0       25
9.0        17
288.0      14
126.0       7
314.0       4
242.0       4
47.0        3
315.0       3
6.0         3
146.0       3
241.0       3
5.0         3
95.0        3
85.0        3
34.0        2
10.0        2
71.0        2
8.0         1
110.0       1
154.0       1
64.0        1
2.0         1
405.0       1
185.0       1
127.0       1
303.0       1
40.0        1
531.0       1
115.0       1
195.0       1
42.0        1
307.0       1
134.0       1
191.0       1
394.0       1
187.0       1
171.0       1
223.0       1
334.0       1
Name: agent, dtype: int64
In [12]:
# 0 sÄ… wolnymi ID w obu przypadkach
(df["agent"] == 0).sum(), (df["company"] == 0).sum()
Out[12]:
(0, 0)
In [13]:
nan_replacements = {"children:": 0, "country": "Unknown", "agent": 0, "company": 0}
hotels_df = df.fillna(nan_replacements)

EDA

In [14]:
hotels_df
Out[14]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit 0.0 0.0 0 Transient 0.00 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit 0.0 0.0 0 Transient 0.00 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit 0.0 0.0 0 Transient 75.00 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 0.0 0 Transient 75.00 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 0.0 0 Transient 98.00 0 1 Check-Out 2015-07-03
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
119385 City Hotel 0 23 2017 August 35 30 2 5 2 ... No Deposit 394.0 0.0 0 Transient 96.14 0 0 Check-Out 2017-09-06
119386 City Hotel 0 102 2017 August 35 31 2 5 3 ... No Deposit 9.0 0.0 0 Transient 225.43 0 2 Check-Out 2017-09-07
119387 City Hotel 0 34 2017 August 35 31 2 5 2 ... No Deposit 9.0 0.0 0 Transient 157.71 0 4 Check-Out 2017-09-07
119388 City Hotel 0 109 2017 August 35 31 2 5 2 ... No Deposit 89.0 0.0 0 Transient 104.40 0 0 Check-Out 2017-09-07
119389 City Hotel 0 205 2017 August 35 29 2 7 2 ... No Deposit 9.0 0.0 0 Transient 151.20 0 2 Check-Out 2017-09-07

119390 rows × 32 columns

In [121]:
hotels_df[['lead_time', 'adr']].hist(figsize=(15,6), bins=80)
hotels_df[['stays_in_weekend_nights', 'stays_in_week_nights']].hist(figsize=(15,6), bins=10, range=(0,9))
hotels_df[['adults', 'children', 'babies']].hist(figsize=(15,12), bins=6, range=(0,5))
plt.show()
In [16]:
num_features = ["is_canceled", "lead_time","arrival_date_week_number","arrival_date_day_of_month",
                "stays_in_weekend_nights","stays_in_week_nights","adults","children",
                "babies","is_repeated_guest", "previous_cancellations", "booking_changes",
                "previous_bookings_not_canceled", "required_car_parking_spaces", 
                "total_of_special_requests", "adr"]

plt.figure(figsize=(12, 10))
heatmap = sns.heatmap(hotels_df[num_features].corr(), annot=True, vmin=-1, vmax=1, cmap="BrBG")
plt.show()

Jak często odwoływane są rezerwacje?

Skoro mamy zajmować się przewidywaniem, czy dana rezerwacja zostanie odwołana, zobaczmy jak częsty jest to problem...

In [47]:
plt.figure(figsize=(12,8))
ax = sns.countplot(data = hotels_df, x = "hotel", hue = "is_canceled")
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+100))

plt.title("Number of canceled and non canceled bookings", fontsize=16)
plt.xlabel("Hotel type", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.show()

Zaskakująco często.

Wpływ typu depozytu na anulowanie rezerwacji

In [137]:
deposit_cancel_data = hotels_df.groupby("deposit_type")["is_canceled"].describe()

plt.figure(figsize=(12, 8))
sns.barplot(x=deposit_cancel_data.index, y=deposit_cancel_data["mean"] * 100, color = "steelblue")
plt.title("Effect of deposit_type on cancelation", fontsize=16)
plt.xlabel("Deposit type", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()
In [122]:
hotels_df["deposit_type"].value_counts()
Out[122]:
No Deposit    104641
Non Refund     14587
Refundable       162
Name: deposit_type, dtype: int64

"Indication on if the customer made a deposit to guarantee the booking.

This variable can assume three categories:

  • No Deposit – no deposit was made;
  • Non Refund – a deposit was made in the value of the total stay cost;
  • Refundable – a deposit was made with a value under the total cost of stay."
In [19]:
hotels_df[hotels_df["deposit_type"] == "Non Refund"].groupby(['hotel', 'is_canceled']).size().reset_index()
Out[19]:
hotel is_canceled 0
0 City Hotel 0 24
1 City Hotel 1 12844
2 Resort Hotel 0 69
3 Resort Hotel 1 1650

Jak zmienia się liczba rezerwacji w ciągu roku?

In [20]:
bookings_monthly = hotels_df[["hotel", "arrival_date_month", "arrival_date_year", "is_canceled", "adr"]]
ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
bookings_monthly.loc[:,"arrival_date_month"] = pd.Categorical(bookings_monthly["arrival_date_month"], categories=ordered_months, ordered=True)
In [21]:
bookings_monthly = bookings_monthly.groupby(["hotel", "arrival_date_month", "is_canceled"]).size().reset_index(name='counts')
#w ramce dane za lipiec i sierpień występują 3 razy, za pozostałe miesiące - 2 razy
bookings_monthly.loc[(bookings_monthly["arrival_date_month"] == "July") | (bookings_monthly["arrival_date_month"]  == "August"),
                    "counts"] /= 3
bookings_monthly.loc[~((bookings_monthly["arrival_date_month"] == "July") | (bookings_monthly["arrival_date_month"]  == "August")),
                    "counts"] /= 2
In [22]:
plt.figure(figsize=(12, 8))
plt.ylim(0, 2500)
sns.lineplot(data=bookings_monthly[bookings_monthly["hotel"] == "City Hotel"], x = "arrival_date_month", y="counts", hue="is_canceled")
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Average number of bookings in City Hotel over the year", fontsize=16)
plt.xlabel("Arrival date month", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
In [23]:
plt.figure(figsize=(12, 8))
plt.ylim(0, 1500)
sns.lineplot(data=bookings_monthly[bookings_monthly["hotel"] == "Resort Hotel"], x = "arrival_date_month", y="counts", hue="is_canceled")
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Average number of bookings in Resort Hotel over the year", fontsize=16)
plt.xlabel("Arrival date month", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
In [19]:
df.country.value_counts()
Out[19]:
PRT    48590
GBR    12129
FRA    10415
ESP     8568
DEU     7287
       ...  
SLE        1
NPL        1
FJI        1
VGB        1
BDI        1
Name: country, Length: 177, dtype: int64

Kraj pochodzenia gości hotelowych

In [24]:
plt.figure(figsize=(12, 8))
sns.countplot(data = hotels_df[hotels_df["hotel"] == "City Hotel"], x = "country", hue = "is_canceled", 
              order = pd.value_counts(hotels_df['country']).iloc[:15].index)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Number of bookings in City Hotel by country of origin of guests", fontsize=16)
plt.xlabel("Country", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
In [25]:
plt.figure(figsize=(12, 8))
sns.countplot(data = hotels_df[hotels_df["hotel"] == "Resort Hotel"], x = "country", hue = "is_canceled", 
              order = pd.value_counts(hotels_df['country']).iloc[:15].index)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Number of bookings in Resort Hotel by country of origin of guests", fontsize=16)
plt.xlabel("Country", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
In [26]:
country_data = pd.DataFrame(hotels_df.loc[hotels_df["is_canceled"] == 0]["country"].value_counts())
country_data.rename(columns={"country": "Number of Guests"}, inplace=True)
total_guests = country_data["Number of Guests"].sum()
country_data["Guests in %"] = round(country_data["Number of Guests"] / total_guests * 100, 2)
country_data["country"] = country_data.index
guest_map = px.choropleth(country_data,
                    locations=country_data.index,
                    color=country_data["Guests in %"], 
                    hover_name=country_data.index, 
                    color_continuous_scale=px.colors.sequential.Plasma,
                    title="Origin country of guests")
guest_map.show()
In [117]:
leadtime = hotels_df.copy()
leadtime["bin"] = pd.qcut(leadtime["lead_time"],  q = np.arange(0.0, 1.0, 0.1))
leadtime_plot = leadtime.groupby("bin")["is_canceled"].mean() * 100
leadtime_plot = leadtime_plot.reset_index()

plt.figure(figsize=(16, 8))
sns.barplot(x = "bin", y="is_canceled", data=leadtime_plot, color = "steelblue")
plt.title("Effect of lead time on cancelations", fontsize=16)
plt.xlabel("Lead time bin", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()
In [127]:
plt.figure(figsize=(6, 8))
sns.boxplot(data = hotels_df, x = "is_canceled", y = "lead_time")
plt.show()

Długość pobytu a liczba rezerwacji

In [27]:
res = hotels_df.loc[(hotels_df["hotel"] == "Resort Hotel")]
city = hotels_df.loc[(hotels_df["hotel"] == "City Hotel")]
In [28]:
res["total_nights"] = res["stays_in_weekend_nights"] + res["stays_in_week_nights"]
city["total_nights"] = city["stays_in_weekend_nights"] + city["stays_in_week_nights"]
res_plot = res.groupby(['total_nights', "is_canceled"]).size().reset_index()
city_plot = city.groupby(['total_nights', "is_canceled"]).size().reset_index()
res_plot
Out[28]:
total_nights is_canceled 0
0 0 0 372
1 0 1 12
2 1 0 6580
3 1 1 1168
4 2 0 4488
... ... ... ...
62 46 0 1
63 56 0 1
64 56 1 1
65 60 0 1
66 69 0 1

67 rows × 3 columns

In [29]:
plt.figure(figsize=(16, 8))
sns.barplot(x = "total_nights", y = 0, hue="is_canceled", data=city_plot)
plt.title("Bookings and length of stay in City Hotel", fontsize=16)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.xlim(0,22)
plt.show()
In [30]:
plt.figure(figsize=(16, 8))
sns.barplot(x = "total_nights", y = 0, hue="is_canceled", data=res_plot)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Guests [%]", fontsize=16)
plt.xlim(0,22)
plt.title("Bookings and length of stay in Resort Hotel", fontsize=16)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()

Sposób rezerwacji a liczba rezerwacji

In [30]:
segments=hotels_df.groupby(["market_segment", "is_canceled"]).size().reset_index()
segments.sort_values(by = [0], ascending = False, inplace = True)
plt.figure(figsize=(16, 8))
sns.barplot(x = "market_segment", y = 0, hue="is_canceled", data=segments)
plt.title("Bookings by market segments", fontsize=16)
plt.xlabel("Market segment", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(loc="upper right", title="is canceled")
plt.show()
In [32]:
segments_pct = segments.pivot_table(index = "market_segment", columns = "is_canceled", values = 0)
segments_pct["pct_of_cancelations"] = segments_pct[1] / (segments_pct[1] + segments_pct[0])
segments_pct.sort_values(by = "pct_of_cancelations")
Out[32]:
is_canceled 0 1 pct_of_cancelations
market_segment
Complementary 646.0 97.0 0.130552
Direct 10672.0 1934.0 0.153419
Corporate 4303.0 992.0 0.187347
Aviation 185.0 52.0 0.219409
Offline TA/TO 15908.0 8311.0 0.343160
Online TA 35738.0 20739.0 0.367211
Groups 7714.0 12097.0 0.610620
Undefined NaN 2.0 NaN

Widzimy chociażby, że rzadziej odwoływane są rezerwacje robione osobiście, bez pośrednictwa. Bardzo często natomiast anulowane są rezerwacje grupowe

In [33]:
plt.figure(figsize=(12, 8))
sns.countplot(data = hotels_df, x = "agent", hue = "is_canceled", 
              order = pd.value_counts(hotels_df['agent']).iloc[:15].index)
plt.title("Bookings by agents", fontsize=16)
plt.xlabel("Agent ID", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.show()
In [34]:
agents_bookings = hotels_df.groupby("agent")["is_canceled"].describe()
agents_bookings[["count", "mean"]].sort_values("count", ascending = False).head(20).sort_values("mean", ascending = False)
Out[34]:
count mean
agent
19.0 1061.0 0.735156
1.0 7191.0 0.734251
229.0 786.0 0.615776
37.0 1230.0 0.582927
21.0 875.0 0.578286
3.0 1336.0 0.577096
9.0 31961.0 0.415006
240.0 13922.0 0.393909
242.0 780.0 0.333333
6.0 3290.0 0.311550
8.0 1514.0 0.282034
83.0 696.0 0.271552
0.0 16340.0 0.246634
14.0 3640.0 0.179121
250.0 2870.0 0.178746
314.0 927.0 0.175836
241.0 1721.0 0.137130
7.0 3539.0 0.133936
40.0 1039.0 0.080847
28.0 1666.0 0.066026

ADR a liczba rezerwacji

In [136]:
plt.figure(figsize=(8, 8))
sns.violinplot(data = hotels_df, x = "is_canceled", y = "adr")
plt.ylim([-100,1000])
plt.show()
In [95]:
city = hotels_df[hotels_df["hotel"] == "City Hotel"]
resort = hotels_df[hotels_df["hotel"] == "Resort Hotel"]
In [96]:
city["bin"] = pd.qcut(city["adr"],  q = np.arange(0.0, 1.1, 0.1))
resort["bin"] = pd.qcut(resort["adr"],  q = np.arange(0.0, 1.1, 0.1))
In [97]:
city = city.groupby(["bin", "is_canceled"])["hotel"].count().reset_index(name = "counts")
resort = resort.groupby(["bin", "is_canceled"])["hotel"].count().reset_index(name = "counts")
In [98]:
city2 = city.pivot_table(index = "bin", columns = "is_canceled", values = "counts").reset_index()
city2["pct_canceled"] = city2[1] / (city2[0] + city2[1])
resort2 = resort.pivot_table(index = "bin", columns = "is_canceled", values = "counts").reset_index()
resort2["pct_canceled"] = resort2[1] / (resort2[0] + resort2[1])
In [99]:
plt.figure(figsize=(16, 8))
sns.barplot(x = "bin", y = "counts", hue="is_canceled", data=city)
plt.title("Bookings in City Hotel by ADR", fontsize=16)
plt.xlabel("ADR", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(loc="upper right", title="is canceled")
plt.show()
In [100]:
city2
Out[100]:
is_canceled bin 0 1 pct_canceled
0 (-0.001, 62.8] 3344 4779 0.588329
1 (62.8, 75.0] 6048 3300 0.353017
2 (75.0, 84.0] 4013 2350 0.369323
3 (84.0, 90.67] 4576 3329 0.421126
4 (90.67, 99.9] 5155 2838 0.355061
5 (99.9, 109.0] 4457 3566 0.444472
6 (109.0, 120.0] 5151 3852 0.427857
7 (120.0, 132.3] 3924 2871 0.422517
8 (132.3, 156.303] 4804 3040 0.387557
9 (156.303, 5400.0] 4756 3177 0.400479
In [101]:
plt.figure(figsize=(16, 8))
sns.barplot(x = "bin", y = "counts", hue="is_canceled", data=resort)
plt.title("Bookings in Resort Hotel by ADR", fontsize=16)
plt.xlabel("ADR", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(loc="upper right", title="is canceled")
plt.show()
In [102]:
resort2
Out[102]:
is_canceled bin 0 1 pct_canceled
0 (-6.381, 37.8] 3279 918 0.218728
1 (37.8, 47.0] 3153 822 0.206792
2 (47.0, 55.074] 2926 920 0.239210
3 (55.074, 65.132] 3072 934 0.233150
4 (65.132, 75.0] 2880 1147 0.284827
5 (75.0, 88.4] 2711 1279 0.320551
6 (88.4, 110.7] 2966 1067 0.264567
7 (110.7, 142.0] 2810 1186 0.296797
8 (142.0, 189.0] 2701 1292 0.323566
9 (189.0, 508.0] 2440 1557 0.389542

Czy przypisanie do innego typu pokoju ma znaczący wpływ na anulowanie rezerwacji?

In [40]:
hotels_df.loc[(hotels_df["reserved_room_type"] != hotels_df["assigned_room_type"]), "is_canceled"].mean()
Out[40]:
0.05376416169471073
In [41]:
hotels_df.loc[(hotels_df["reserved_room_type"] == hotels_df["assigned_room_type"]), "is_canceled"].mean()
Out[41]:
0.41562891847654415

Raczej nie bardzo

Czy wyjazdy z dziećmi są częściej odwoływane?

In [42]:
hotels_df.loc[hotels_df["babies"] + hotels_df["children"] > 0, "is_canceled"].mean()
Out[42]:
0.3492284612087441
In [43]:
hotels_df.loc[hotels_df["babies"] + hotels_df["children"] == 0, "is_canceled"].mean()
Out[43]:
0.3721900158104203

Też nie

Czy ponowni goście są mniej skłonni do odwoływania rezerwacji?

In [44]:
hotels_df.loc[hotels_df["is_repeated_guest"] == 1, "is_canceled"].mean()
Out[44]:
0.14488188976377953
In [45]:
hotels_df.loc[hotels_df["is_repeated_guest"] == 0, "is_canceled"].mean()
Out[45]:
0.3778508392455442

TAK

In [ ]: